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Abstract 

A search engine usually outputs a list of K web 
pages. The user examines this list, from the first 
web page to the last, and chooses the first attrac¬ 
tive page. This model of user behavior is known 
as the cascade model. In this paper, we propose 
cascading bandits, a learning variant of the cas¬ 
cade model where the objective is to identify K 
most attractive items. We formulate our problem 
as a stochastic combinatorial partial monitoring 
problem. We propose two algorithms for solving 
it, CascadeUCBl and CascadeKL-UCB. We also 
prove gap-dependent upper bounds on the regret 
of these algorithms and derive a lower bound on 
the regret in cascading bandits. The lower bound 
matches the upper bound of CascadeKL-UCB up 
to a logarithmic factor. We experiment with our 
algorithms on several problems. The algorithms 
perform surprisingly well even when our model¬ 
ing assumptions are violated. 

1. Introduction 

The cascade model is a popular model of user behavior in 
web search (Craswell et al., 2008). In this model, the user 
is recommended a list of K items, such as web pages. The 
user examines the recommended list from the first item to 
the last, and selects the first attractive item. In web search, 
this is manifested as a click. The items before the first at¬ 
tractive item are not attractive, because the user examines 
these items but does not click on them. The items after the 

Proceedings of the 32"“^ International Conference on Machine 
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy¬ 
right 2015 by the author(s). 


KVETON@ADOBE.COM 

SZEPESVA@CS.UALBERTA.CA 

ZHENGWEN@YAHOO-INC.COM 

AZIN.ASHKAN@TECHNICOLOR.COM 


first attractive item are unobserved, because the user never 
examines these items. The optimal list, the list of K items 
that maximizes the probability that the user finds an attrac¬ 
tive item, are K most attractive items. The cascade model 
is simple but effective in explaining the so-called position 
bias in historical click data (Craswell et al., 2008). There¬ 
fore, it is a reasonable model of user behavior. 

In this paper, we propose an online learning variant of the 
cascade model, which we refer to as cascading bandits. In 
this model, the learning agent does not know the attraction 
probabilities of items. At time t, the agent recommends to 
the user a list of K items out of L items and then observes 
the index of the item that the user clicks. If the user clicks 
on an item, the agent receives a reward of one. The goal of 
the agent is to maximize its total reward, or equivalently to 
minimize its cumulative regret with respect to the list of K 
most attractive items. Our learning problem can be viewed 
as a bandit problem where the reward of the agent is a part 
of its feedback. But the feedback is richer than the reward. 
Specifically, the agent knows that the items before the first 
attractive item are not attractive. 

We make five contributions. First, we formulate a learning 
variant of the cascade model as a stochastic combinatorial 
partial monitoring problem. Second, we propose two algo¬ 
rithms for solving it, CascadeUCBl and CascadeKL-UCB. 
CascadeUCBl is motivated by CombUCBl, a computation¬ 
ally and sample efficient algorithm for stochastic combina¬ 
torial semi-bandits (Gai et al., 2012; Kveton et al., 2015). 
CascadeKL-UCB is motivated by KL-UCB and we expect it 
to perform better when the attraction probabilities of items 
are low (Garivier & Cappe, 201 1). This setting is common 
in the problems of our interest, such as web search. Third, 
we prove gap-dependent upper bounds on the regret of our 
algorithms. Fourth, we derive a lower bound on the regret 
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in cascading bandits. This bound matches the upper bound 
of CascadeKL-UCB up to a logarithmic factor. Finally, we 
experiment with our algorithms on several problems. They 
perform well even when our modeling assumptions are not 
satisfied. 

Our paper is organized as follows. In Section 2, we review 
the cascade model. In Section 3, we introduce our learning 
problem and propose two UCB-like algorithms for solving 
it. In Section 4, we derive gap-dependent upper bounds on 
the regret of CascadeUCBl and CascadeKL-UCB. In addi¬ 
tion, we prove a lower bound and discuss how it relates to 
our upper bounds. We experiment with our learning algo¬ 
rithms in Section 5. In Section 6, we review related work. 
We conclude in Section 7. 

2. Background 

Web pages in a search engine can be ranked automatically 
by fitting a model of user behavior in web search from his¬ 
torical click data (Radlinski & Joachims, 2005; Agichtein 
et ah, 2006). The user is typically assumed to scan a list of 
K web pages A = (oi,..., Qk), which we call items. The 
items belong to some ground set E = L}, such as 

the set of all web pages. Many models of user behavior in 
web search exist (Becker et ah, 2007; Craswell et ah, 2008; 
Richardson et al., 2007). Each of them explains the clicks 
of the user differently. We focus on the cascade model. 

The cascade model is a popular model of user behavior in 
web search (Craswell et al., 2008). In this model, the user 
scans a list of K items A = (oi,..., ok) G Ak{E) from 
the hrst item oi to the last ok, where IVk{E) is the set of 
all K-permutations of set E. The model is parameterized 
by attraction probabilities w G [0,1]'®. After the user ex¬ 
amines item Ofc, the item attracts the user with probability 
w(ak), independently of the other items. If the user is at¬ 
tracted by item a^, the user clicks on it and does not exam¬ 
ine the remaining items. If the user is not attracted by item 
Ofc, the user examines item Ok+i- It is easy to see that the 
probability that item Uk is examined is ~ w{ai)), 

and that the probability that at least one item in A is attrac¬ 
tive is 1 — a=i(i — w{ai)). This objective is maximized 
by K most attractive items. 

The cascade model assumes that the user clicks on at most 
one item. In practice, the user may click on multiple items. 
The cascade model cannot explain this pattern. Therefore, 
the model was extended in several directions, for instance 
to take into account multiple clicks and the persistence of 
users (Chapelle & Zhang, 2009; Guo et al., 2009a;b). The 
extended models explain click data better than the cascade 
model. Nevertheless, the cascade model is still very attrac¬ 
tive, because it is simpler and can be reasonably ht to click 
data. Therefore, as a first step towards understanding more 
complex models, we study an online variant of the cascade 


model in this work. 

3. Cascading Bandits 

We propose a learning variant of the cascade model (Sec¬ 
tion 3.1) and two computationally-efficient algorithms for 
solving it (Section 3.2). To simplify exposition, all random 
variables are written in bold. 

3.1. Setting 

We refer to our learning problem as a generalized cascad¬ 
ing bandit. Formally, we represent the problem by a tuple 
B — {E, P, K), where = {1,..., L} is a ground set of 
L items, P is a probability distribution over a unit hyper¬ 
cube {0,1}^, and K < L is the number of recommended 
items. We call the bandit generalized because the form of 
the distribution P has not been specihed yet. 

Let be an i.i.d. sequence of n weights drawn from 

P, where S {0,1}^ and W((e) is the preference of the 
user for item e at time t. That is, Wj (e) = 1 if and only if 
item e attracts the user at time t. The learning agent inter¬ 
acts with our problem as follows. At time t, the agent rec¬ 
ommends a list of K items At = (a*,..., a^) G nx(E). 
The list is computed from the observations of the agent up 
to time t. The user examines the list, from the first item a* 
to the last a^, and clicks on the first attractive item. If the 
user is not attracted by any item, the user does not click on 
any item. Then time increases to f -f 1. 

The reward of the agent at time t can be written in several 
forms. For instance, as max^ Wt(a^), at least one item in 
list At is attractive; or as /(Aj, Wj), where; 

K 

f{A,w) = 1 - J|(l - w{ak)), 

k^l 

A = (ai,..., Ok) G IlKiE), and w G {0,1}'®. This later 
algebraic form is particularly useful in our proofs. 

The agent at time t receives feedback; 

Ct = argmin {l < A: < A ; wt(a^) = 1} , 

where we assume that argmin 0 = c». The feedback Ct 
is the click of the user. If Ct < K, the user clicks on item 
Ct. If Ct = oo, the user does not click on any item. Since 
the user clicks on the hrst attractive item in the list, we can 
determine the observed weights of all recommended items 
at time t from Ct. In particular, note that; 

wt(afc) = IjCt = fc} fc = 1,... ,min{Ct, A} . (1) 

We say that item e is observed at time f if e = a^ for some 
1 < A: < min{Ct, A}. 
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In the cascade model (Section 2), the weights of the items 
in the ground set E are distributed independently. We also 
make this assumption. 

Assumption 1. The weights w are distributed as: 

P{w) = Pe{w{e)), 

eeE 

where Pe is a Bernoulli distribution with mean w[e). 

Under this assumption, we refer to our learning problem as 
a cascading bandit. In this new problem, the weight of any 
item at time t is drawn independently of the weights of the 
other items at that, or any other, time. This assumption has 
profound consequences and leads to a particularly efficient 
learning algorithm in Section 3.2. More specifically, under 
our assumption, the expected reward for list A G IIk{E), 
the probability that at least one item in A is attractive, can 
be expressed as E [f{A, w)] = f{A, w), and depends only 
on the attraction probabilities of individual items in A. 

The agent’s policy is evaluated by its expected cumulative 
regret. 


R{n) = E 


.t=i 


where R{At, wt) = f{A*,'Wt) — f{At, wt) is the instan¬ 
taneous stochastic regret of the agent at time t and: 


A* = argmax f{A,w) 

AdUKiE) 

is the optimal list of items, the list that maximized the re¬ 
ward at any time t. Since / is invariant to the permutation 
of A, there exist at least Kl optimal lists. For simplicity of 
exposition, we assume that the optimal solution, as a set, is 
unique. 


3.2. Algorithms 

We propose two algorithms for solving cascading bandits, 
CascadeUCBl and CascadeKL-UCB. CascadeUCBl is mo¬ 
tivated by UCBl (Auer et al., 2002) and CascadeKL-UCB is 
motivated by KL-UCB (Garivier & Cappe, 201 1). 

The pseudocode of both algorithms is in Algorithm 1 . The 
algorithms are similar and differ only in how they estimate 
the upper confidence bound (UCB) Ut(e) on the attraction 
probability of item e at time t. After that, they recommend 
a list of K items with largest UCBs: 

At = argmax /(A,Ut). (2) 

Aen^iE) 


Algorithm 1 UCB-like algorithm for cascading bandits. 

// Initialization 
Observe Wq P 
WeGE: To(e) ^ 1 
\/e G E : Wi(e) ^ wo(e) 

for all f = 1,..., n do 

Compute UCBs Uf (e) (Section 3.2) 

// Recommend a list of K items and get feedback 
Let ,..., be K items with largest UCBs 
Af G- (a*,..., a^) 

Observe click Ct G {1,... ,K, cx)} 

// Update statistics 
VeGL;:Tt(e)^Tt_i(e) 
for all fc = 1,... ,min{Cf,iT} do 
e <- al 

Tt(e) Tt(e) + 1 

/ ^ Tf_i(e)wxt_i(e)(e)-f ]l{Ct =/c} 


unspecified and return to it later in our discussions. After 
the user provides feedback Ct, the algorithms update their 
estimates of the attraction probabilities w(e) based on (1), 
for all e = aj, where k < Ct. 

The UCBs are computed as follows. In CascadeUCBl, the 
UCB on the attraction probability of item e at time t is: 

Ut(e) = WT,_i(e)(e) -f Ct_i,Tt_i(e) , 

where Ws(e) is the average of s observed weights of item 
e, Tt (e) is the number of times that item e is observed in t 
steps, and: 

ct,s = V(1.51ogf)/s 

is the radius of a confidence interval around (e) after t 
steps such that w{e) G [w 5 (e) — Ct^s,'^s(e) + G.s] holds 
with high probability. In CascadeKL-UCB, the UCB on the 
attraction probability of item e at time t is: 

Ut(e) = max{g G [’WTt_i(e)(e), 1] : 
Tt_i(e)£)KL(wT,_i(e)(e) || q) < log f-f 3 log logf} , 

where Dkl{p || q) is the Kullback-Leibler (KL) divergence 
between two Bernoulli random variables with means p and 
q. Since Dkl(p || ?) is an increasing function of q for q > 
p, the above UCB can be computed efficiently. 

3.3. Initialization 


Note that At is determined only up to a permutation of the 
items in it. The payoff is not affected by this ordering. But 
the observations are. For now, we leave the order of items 


Both algorithms are initialized by one sample Wq from P. 
Such a sample can be generated in 0{L) steps, by recom¬ 
mending each item once as the first item in the list. 
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4. Analysis 

Our analysis exploits the fact that our reward and feedback 
models are closely connected. More specifically, we show 
in Section 4.1 that the learning algorithm can suffer regret 
only if it recommends suboptimal items that are observed. 
Based on this result, we prove upper bounds on the n-step 
regret of CascadeUCBl and CascadeKL-UCB (Section 4.2). 
We prove a lower bound on the regret in cascading bandits 
in Section 4.3. We discuss our results in Section 4.4. 

4.1. Regret Decomposition 


be the event that item e is chosen instead of item e* at time 
t, and that item e is observed. Then there exists a permuta¬ 
tion TTt of optimal items {1,..., i^}, which is a determin¬ 
istic function ofTLty such thatJJt^SL^) > \Jt{TTt{k)) for all 
k. Moreover: 

L K 

E*[i?(A*,Wt)] < ^ ^ Ae,e.Et [l{Ge,e*,t}] 

e—K-\-l 1 

L K 

]Et [i?(At,Wt)] > a Ae^e* Et [l{G'e,e*,t}] > 

e*=l 


Without loss of generality, we assume that the items in the 
ground set E are sorted in decreasing order of their attrac¬ 
tion probabilities, w(l) > ... > w{L). In this setting, the 
optimal solution is A* = (!,..., K), and contains the hrst 
K items in E. We say that item e is optimal \fl<e<K. 
Similarly, we say that item e is suboptimal if K < e < L. 
The gap between the attraction probabilities of suboptimal 
item e and optimal item e*: 

Ae,e* = w{e*) - w{e) (3) 

measures the hardness of discriminating the items. When¬ 
ever convenient, we view an ordered list of items as the set 
of items on that list. 


where a = (1 — iy(l))^ ^ and w(l) is the attraction prob¬ 
ability of the most attractive item. 

Proof. We dehne ttj as follows. For any k, if the fc-th item 
in At is optimal, we place this item at position k, TTt{k) = 
a^. The remaining optimal items are positioned arbitrarily. 
Since A* is optimal with respect to w, 'w{aL\) < w{Ttt{k)) 
for all k. Similarly, since At is optimal with respect to Uj, 
U*(a*) > Ut(7ri(fc)) for all k. Therefore, ttj is the desired 
permutation. 

The permutation tt^ reorders the optimal items in a conve¬ 
nient way. Since time t is hxed, let = TTt{k). Then; 


Our main technical lemma is below. The lemma says that 
the expected value of the difference of the products of ran¬ 
dom variables can be written in a particularly useful form. 

Lemma 1. Let A = (oi,..., ax) and B = (6i,..., bx) 
be any two lists of K items from Ilx{E) such that at = bj 
only if i = j. Let w ~ P in Assumption 1. Then: 


E 


K 


K 


w(afe) - w(6fc) 


LA:=1 




K 


/c=l 


'k-l 




K 


E [w(afe) - w(6fe)] I E[w(6j)]j . 

I j = /c+l 


Proof. The claim is proved in Appendix B. ■ 


Et [i?(At,Wt)] = 


E, 


K 


K 


]^(l - wt(a^)) - ]^(1 - wt(a^)) 






Now we exploit the fact that the entries of are indepen¬ 
dent of each other given 1-Lt. By Lemma 1, we can rewrite 
the right-hand side of the above equation as: 


K 




'k-l 

11(1 “ wt(a‘)) 


Et [wt(a^) 


wt(a*fc)] X 


n E‘[l 

j=k + l 



Let; 

= (Ai, Cl,..., At_i, Ct_i, A() (4) 

be the history of the learning agent up to choosing At, the 
first t—1 observations and t actions. Let E* [•] = E [• | LLt] 
be the conditional expectation given history Ht. We bound 
Ej [ii(A 4 , Wt)], the expected regret conditioned on history 
Pt, as follows. 

Theorem 1. For any item e and optimal item e*, let: 
Ge,e*,t = {31 < fc < A s.t. a{, = e, 7rt(fc) = e*, (5) 

wt(a‘i) = ... = wt(a{_i) =0} 


Note that Ej [wt(a^) — Wi(a{,)] = A^t . Furthermore, 
nE(l-Wi(a*)) = lL|Ga‘,a*,t} by conditioning on At. 
Therefore, we get that Et [i?(At, wt)] is equal to; 

n E.[l-W,(a‘)] . 

k—1 j—k+l 

By dehnition of TTt, A^t a* = 0 when item a{ is optimal. 
In addition, 1 - w(l) < Et [l — wt(ap] < 1 for any op¬ 
timal a*. Our upper and lower bounds on Et [i?(At, Wt)] 
follow from these observations. ■ 
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4.2. Upper Bounds 

In this section, we derive two upper bounds on the n-step 
regret of CascadeUCBl and CascadeKL-UCB. 

Theorem 2. The expected n-step regret o/CascadeUCBl 
is bounded as: 


R{n) < 


L 


E 

e=K+l 


12 

^e,K 


logn + 



Proof. The complete proof is in Appendix A. 1. The proof 
has four main steps. First, we bound the regret of the event 
that w{e) is outside of the high-probability confidence in¬ 
terval around WTt_i(e){e) for at least one item e. Second, 
we decompose the regret at time t and apply Theorem 1 to 
bound it from above. Third, we bound the number of times 
that each suboptimal item is chosen in n steps. Fourth, we 
peel off an extra factor of K in our upper bound based on 
Kveton et al. (2014a). Finally, we sum up the regret of all 
suboptimal items. ■ 

Theorem 3. For any e > 0, the expected n-step regret of 
CascadeKL-UCB is bounded as: 


R{n) < 

e^K+1 


(1 -f e)Ae,K{i + log(l/Ae,K)) 
^KL(w(e) II w{K)) 

(log n -I- 3 log log n) -\-C , 


where C = -f TiCloglogn, and the constants 

C 2 {s) and /3{s) are defined in Garivier & Cappe (2011 ). 


Proof The complete proof is in Appendix A.2. The proof 
has four main steps. First, we bound the regret of the event 
that w(e) > Ut(e) for at least one optimal item e. Second, 
we decompose the regret at time t and apply Theorem 1 to 
bound it from above. Third, we bound the number of times 
that each suboptimal item is chosen in n steps. Fourth, we 
derive a new peeling argument for KL-UCB (Lemma 2) and 
eliminate an extra factor of K in our upper bound. Finally, 
we sum up the regret of all suboptimal items. ■ 

4.3. Lower Bound 

Our lower bound is derived on the following problem. The 
ground set contains L items i? = {1,..., L}. The distribu¬ 
tion P is a product of L Bernoulli distributions Pg, each of 
which is parameterized by: 

ip e < K 

= othemise, ® 

where A G (0,p) is the gap between any optimal and sub¬ 
optimal items. We refer to the resulting bandit problem as 
K,p, A); and parameterize it by L, K, p, and A. 


Our lower bound holds for consistent algorithms. We say 
that the algorithm is consistent if for any cascading bandit, 
any suboptimal list A, and any a > 0, E [T„(A)] = o{n^), 
where T„(A) is the number of times that list A is recom¬ 
mended in n steps. Note that the restriction to the consis¬ 
tent algorithms is without loss of generality. The reason is 
that any inconsistent algorithm must suffer polynomial re¬ 
gret on some instance of cascading bandits, and therefore 
cannot achieve logarithmic regret on every instance of our 
problem, similarly to CascadeUCBl and CascadeKL-UCB. 

Theorem 4. For any cascading bandit Plb. regret of 
any consistent algorithm is bounded from below as: 

n-^oo logn Dkl(P—^\\P) 


Proof. By Theorem 1, the expected regret at time t condi¬ 
tioned on history 1-Lt is bounded from below as: 

L K 

Et [P(At, Wt)] > A(1 - y] y] E [l{Ge.e..t}] . 

e*=l 


Based on this result, the n-step regret is bounded as: 


L 

P(n) > A(l-p)^-i Y ® 

e^K+l 


r n K 


Y Y 


L 

= A(l-p)^-i y] E[T„(e)], 

e=K+l 


where the last step is based on the fact that the observation 
counter of item e increases if and only if event Ge,e*,t hap¬ 
pens. By the work of Lai & Robbins (1985), we have that 
for any suboptimal item e: 


lim inf 

n—^oo 


E[T„(e)] 

logn 


1 

“ Pkl(p- A| 1 p) ■ 


Otherwise, the learning algorithm is unable to distinguish 
instances of our problem where item e is optimal, and thus 
is not consistent. Finally, we chain all inequalities and get: 

n^x> logn P)kl(p-A||p) 


This concludes our proof. ■ 


Our lower bound is practical when no optimal item is very 
attractive, p < 1 /K. In this case, the learning agent must 
learn K sufficiently attractive items to identify the optimal 
solution. This lower bound is not practical when p is close 
to 1, because it becomes exponentially small. In this case, 
other lower bounds would be more practical. For instance, 
consider a problem with L items where item 1 is attractive 
with probability one and all other items are attractive with 
probability zero. The optimal list of K items in this prob¬ 
lem can be found in L/ (2K) steps in expectation. 
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4.4. Discussion 

We prove two gap-dependent upper bounds on the n-step 
regret of CascadeUCBl (Theorem 2) and CascadeKL-UCB 
(Theorem 3). The bounds are 0(log n), linear in the num¬ 
ber of items L, and they improve as the number of recom¬ 
mended items K increases. The bounds do not depend on 
the order of recommended items. This is due to the nature 
of our proofs, where we count events that ignore the posi¬ 
tions of the items. We would like to extend our analysis in 
this direction in future work. 

We discuss the tightness of our upper bounds on problem 
K,p, A) in Section 4.3 where we setp = 1/K. In 
this problem, Theorem 4 yields an asymptotic lower bound 
of: 

(7) 

since (1 — > 1/e for K > 1. The n-step regret 

of CascadeUCBl is bounded by Theorem 2 as: 

0{{L-K)^logn) 

= O {{L - K)-^\ogn) 

— O II p) logn^ 

= O (k{L - logn) , (8) 

where the second equality is by I1 kl(p — A || p) < . 

The n-step regret of CascadeKL-UCB is bounded by Theo¬ 
rem 3 as: 

and matches the lower bound in (7) up to log(l/A). Note 
that the upper bound of CascadeKL-UCB (9) is below that 
of CascadeUCBl (8) when log(l/A) = 0{K), or equiva¬ 
lently when A = fl(e“^). It is an open problem whether 
the factor of log(l/A) in (9) can be eliminated. 

5. Experiments 

We conduct four experiments. In Section 5.1, we validate 
that the regret of our algorithms scales as suggested by our 
upper bounds (Section 4.2). In Section 5.2, we experiment 
with recommending items At in the opposite order, in in¬ 
creasing order of their UCBs. In Section 5.3, we show that 
CascadeKL-UCB performs robustly even when our model¬ 
ing assumptions are violated. In Section 5.4, we compare 
CascadeKL-UCB to ranked bandits. 

5.1. Regret Bounds 

In the first experiment, we validate the qualitative behavior 
of our upper bounds (Section 4.2). We experiment with the 


L 

K 

A 

CascadeUCBl 

CascadeKL-UCB 

16 

2 

0.15 

1290.1 ± 11.3 

357.9 ±5.5 

16 

4 

0.15 

986.8 ± 10.8 

275.1 ±5.8 

16 

8 

0.15 

574.8 ± 7.9 

149.1 ±3.2 

32 

2 

0.15 

2695.9 ± 19.8 

761.2 ± 10.4 

32 

4 

0.15 

2256.8 ± 12.8 

633.2 ±7.0 

32 

8 

0.15 

1581.0 ±20.3 

435.4 ±5.7 

16 

2 

0.075 

2077.0 ± 32.9 

766.0 ± 18.0 

16 

4 

0.075 

1520.4 ±23.4 

538.5 ± 12.5 

16 

8 

0.075 

725.4 ± 12.0 

321.0 ± 16.3 


Table 1. The n-step regret of CascadeUCBl and CascadeKL-UCB 
in n = 10® steps. The list At is ordered from the largest UCB to 
the smallest. All results are averaged over 20 runs. 


L 

K 

A 

CascadeUCBl 

Case adeKL-UCB 

16 

2 

0.15 

1160.2 ± 11.7 

333.3 ±6.1 

16 

4 

0.15 

660.0 ±8.3 

209.4 ±4.4 

16 

8 

0.15 

181.4 ±3.9 

60.4 ±2.0 

32 

2 

0.15 

2471.6 ± 14.1 

716.0 ±7.5 

32 

4 

0.15 

1615.3 ± 14.5 

482.3 ±6.7 

32 

8 

0.15 

595.0 ±7.8 

201.9 ±5.8 

16 

2 

0.075 

1989.8 ±31.4 

785.8 ± 12.2 

16 

4 

0.075 

1239.5 ± 16.2 

484.2 ± 12.5 

16 

8 

0.075 

336.4 ± 10.3 

139.7 ±6.6 


Table 2. The n-step regret of CascadeUCBl and CascadeKL-UCB 
in n = 10® steps. The list At is ordered from the smallest UCB 
to the largest. All results are averaged over 20 runs. 

class of problems Bbb{L, K,p, A) in Section 4.3. We set 
p = 0.2; and vary L, K, and A. The attraction probability 
p is set such that it is close to 1/iT for the maximum value 
of K in our experiments. Our upper bounds are reasonably 
tight in this setting (Section 4.4), and we expect the regret 
of our methods to scale accordingly. We recommend items 
At in decreasing order of their UCBs. This order is moti¬ 
vated by the problem of web search, where higher ranked 
items are typically more attractive. We run CascadeUCBl 
and CascadeKL-UCB for n = 10® steps. 

Our results are reported in Table 1. We observe four major 
trends. First, the regret doubles when the number of items 
L doubles. Second, the regret decreases when the number 
of recommended items K increases. These trends are con¬ 
sistent with the fact that our upper bounds aie 0{L — K). 
Third, the regret increases when A decreases. Finally, note 
that CascadeKL-UCB outperforms CascadeUCBl. This re¬ 
sult is not particularly surprising. KL-UCB is known to out¬ 
perform UCBl when the expected payoffs of arms are low 
(Garivier & Cappe, 2011), because its confidence intervals 
get tighter as the Bernoulli parameters get closer to 0 or 1. 

5.2. Worst-of-Best First Item Ordering 

In the second experiment, we recommend items At in in¬ 
creasing order of their UCBs. This choice is not very natu¬ 
ral and may be even dangerous. In practice, the user could 
get annoyed if highly ranked items were not attractive. On 
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the other hand, the user would provide a lot of feedback on 
low quality items, which could speed up learning. We note 
that the reward in our model does not depend on the order 
of recommended items (Section 3.2). Therefore, the items 
can be ordered arbitrarily, perhaps to maximize feedback. 
In any case, we find it important to study the effect of this 
counterintuitive ordering, at least to demonstrate the effect 
of our modeling assumptions. 

The experimental setup is the same as in Section 5.1. Our 
results are reported in Table 2. When compared to Table 1, 
the regret of CascadeUCBl and CascadeKL-UCB decreases 
for all settings of K, L, and A; most prominently at large 
values of K. Our current analysis cannot explain this phe¬ 
nomenon and we leave it for future work. 

5.3. Imperfect Model 

The goal of this experiment is to evaluate CascadeKL-UCB 
in the setting where our modeling assumptions are not sat¬ 
isfied, to test its potential beyond our model. We generate 
data from the dynamic Bayesian network (DBN) model of 
Chapelle & Zhang (2009), a popular extension of the cas¬ 
cade model which is parameterized by attraction probabil¬ 
ities p G [0,1]^, satisfaction probabilities v G [0,1]^, and 
the persistence of users 7 G (0,1]. In the DBN model, the 
user is recommended a list of K items A = (ui, ..., ax) 
and examines it from the first recommended item ai to the 
last ax- After the user examines item at, the item attracts 
the user with probability p{ak)- When the user is attracted 
by the item, the user clicks on it and is satisfied with prob¬ 
ability v{ak)- If the user is satisfied, the user does not ex¬ 
amine the remaining items. In any other case, the user ex¬ 
amines item ttfe+i with probability 7. The reward is one if 
the user is satisfied with the list, and zero otherwise. Note 
that this is not observed. The regret is defined accordingly. 
The feedback are clicks of the user. Note that the user can 
click on multiple items. 

The probability that at least one item in A = (ai, ..., ax) 
is satisfactory is: 

K fc-l 

7'""^u;(afc) n . 

k—1 i—1 

where w{e) = p{e)v{e) is the probability that item e satis¬ 
fies the user after being examined. This objective is maxi¬ 
mized by the list of K items with largest weights w{e) that 
are ordered in decreasing order of their weights. Note that 
the order matters. 

The above objective is similar to that in cascading bandits 
(Section 3). Therefore, it may seem that our learning algo¬ 
rithms (Section 3.2) can also learn the optimal solution to 
the DBN model. Unfortunately, this is not guaranteed. The 
reason is that not all clicks of the user are satisfactory. We 
illustrate this issue on a simple problem. Suppose that the 



Figure 1. The n-step regret of CascadeKL-UCB (solid lines) and 
RankedKL-UCB (dotted lines) in the DBN model in Section 5.3. 

user clicks on multiple items. Then only the last click can 
be satisfactory. But it does not have to be. For instance, it 
could have happened that the user was unsatisfied with the 
last click, and then scanned the recommended list until the 
end and left. 

We experiment on the class of problems i?LB(A, K,p, A) 
in Section 4.3 and modify it as follows. The ground set E 
has L = 16 items and K = A. The attraction probability of 
item e is p{e) = w{e), where w{e) is given in (6). We set 
A = 0.15. The satisfaction probabilities v{e) of all items 
are the same. We experiment with two settings of v{e), 1 
and 0.7; and with two settings of persistence 7, 1 and 0.7. 
We run CascadeKL-UCB for n = 10® steps and use the last 
click as an indicator that the user is satisfied with the item. 

Our results are reported in Figure 1. We observe in all ex¬ 
periments that the regret of CascadeKL-UCB flattens. This 
indicates that CascadeKL-UCB learns the optimal solution 
to the DBN model. An intuitive explanation for this result 
is that the exact values of w{e) are not needed to perform 
well. Our current theory does not explain this phenomenon 
and we leave it for future work. 

5.4. Ranked Bandits 

In our final experiment, we compare CascadeKL-UCB to a 
ranked bandit (Section 6) where the base bandit algorithm 
is KL-UCB. We refer to this method as RankedKL-UCB. The 
choice of the base algorithm is motivated by the following 
reasons. First, KL-UCB is the best performing oracle in our 
experiments. Second, since both compared approaches use 
the same oracle, the difference in their regrets is likely due 
to their statistical efficiency, and not the oracle itself. 

The experimental setup is the same as in Section 5.3. Our 
results are reported in Figure 1. We observe that the regret 
of RankedKL-UCB is significantly larger than the regret of 
CascadeKL-UCB, about three times. The reason is that the 
regret in ranked bandits is ^{K) (Section 6) and iT = 4 in 
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this experiment. The regret of our algorithms is 0{L — K) 
(Section 4.4). Note that CascadeKL-UCB is not guaranteed 
to be optimal in this experiment. Therefore, our results are 
encouraging and show that CascadeKL-UCB could be a vi¬ 
able alternative to more established approaches. 

6. Related Work 

Ranked bandits are a popular approach in learning to rank 
(Radlinski et al., 2008) and they are closely related to our 
work. The key characteristic of ranked bandits is that each 
position in the recommended list is an independent bandit 
problem, which is solved by some base bandit algorithm. 
The solutions in ranked bandits are (1 — 1/e) approximate 
and the regret is Q,{K) (Radlinski et al., 2008), where K is 
the number of recommended items. Cascading bandits can 
be viewed as a form of ranked bandits where each recom¬ 
mended item attracts the user independently. We propose 
novel algorithms for this setting that can learn the optimal 
solution and whose regret decreases with K. We compare 
one of our algorithms to ranked bandits in Section 5.4. 

Our learning problem is of a combinatorial nature, our ob¬ 
jective is to learn K most attractive items out of L. In this 
sense, our work is related to stochastic combinatorial ban¬ 
dits, which are often studied with linear rewards and semi¬ 
bandit feedback (Gai et al., 2012; Kveton et al., 2014a;b; 
2015). The key differences in our work are that the reward 
function is non-linear in unknown parameters; and that the 
feedback is less than semi-bandit, only a subset of the rec¬ 
ommended items is observed. 

Our reward function is non-linear in unknown parameters. 
These types of problems have been studied before in vari¬ 
ous contexts. Filippi et al. (2010) proposed and analyzed a 
generalized linear bandit with bandit feedback. Chen et al. 
(2013) studied a variant of stochastic combinatorial semi¬ 
bandits whose reward function is a known monotone func¬ 
tion of a linear function in unknown parameters. Le et al. 
(2014) studied a network optimization problem whose re¬ 
ward function is a non-linear function of observations. 

Bartok et al. (2012) studied finite partial monitoring prob¬ 
lems. This is a very general class of problems with finitely 
many actions, which are chosen by the learning agent; and 
finitely many outcomes, which are determined by the envi¬ 
ronment. The outcome is unobserved and must be inferred 
from the feedback of the environment. Cascading bandits 
can be viewed as finite partial monitoring problems where 
the actions are lists of K items out of L and the outcomes 
are the corners of a L-dimensional binary hypercube. Bar¬ 
tok et al. (2012) proposed an algorithm that can solve such 
problems. This algorithm is computationally inefficient in 
our problem because it needs to reason over all pairs of ac¬ 
tions and stores vectors of length 2^. Bartok et al. (2012) 
also do not prove logarithmic distribution-dependent regret 


bounds as in our work. 

Agrawal et al. (1989) studied a partial monitoring problem 
with non-linear rewards. In this problem, the environment 
draws a state from a distribution that depends on the action 
of the learning agent and an unknown parameter. The form 
of this dependency is known. The state of the environment 
is observed and determines reward. The reward is a known 
function of the state and action. Agrawal et al. (1989) also 
proposed an algorithm for their problem and proved a log¬ 
arithmic distribution-dependent regret bound. Similarly to 
Bartok et al. (2012), this algorithm is computationally in¬ 
efficient in our setting. 

Lin et al. (2014) studied partial monitoring in combinato¬ 
rial bandits. The setting of this work is different from ours. 
Lin et al. (2014) assume that the feedback is a linear func¬ 
tion of the weights of the items that is indexed by actions. 
Our feedback is a non-linear function of the weights of the 
items. 

Mannor &. Shamir (2011) and Caron et al. (2012) studied an 
opposite setting to ours, where the learning agent observes 
a superset of chosen items. Chen et al. (2014) studied this 
problem in stochastic combinatorial semi-bandits. 

7. Conclusions 

In this paper, we propose a learning variant of the cascade 
model (Craswell et al., 2008), a popular model of user be¬ 
havior in web search. We propose two algorithms for solv¬ 
ing it, CascadeUCBl and CascadeKL-UCB, and prove gap- 
dependent upper bounds on their regret. Our analysis ad¬ 
dresses two main challenges of our problem, a non-linear 
reward function and limited feedback. We evaluate our al¬ 
gorithms on several problems and show that they perform 
well even when our modeling assumptions are violated. 

We leave open several questions of interest. For instance, 
we show in Section 5.3 that CascadeKL-UCB can learn the 
optimal solution to the DBN model. This indicates that the 
DBN model is leamable in the bandit setting and we leave 
this for future work. Note that the regret in cascading ban¬ 
dits is r2(L) (Section 4.3). Therefore, our learning frame¬ 
work is not practical when the number of items L is large. 
Similarly to Slivkins et al. (2013), we plan to address this 
issue by embedding the items in some feature space, along 
the lines of Wen et al. (2015). Finally, we want to general¬ 
ize our results to more complex problems, such as learning 
routing paths in computer networks where the connections 
fail with unknown probabilities. 

From the theoretical point of view, we would like to close 
the gap between our upper and lower bounds. In addition, 
we want to derive gap-free bounds. Finally, we would like 
to refine our analysis so that it explains that the reverse or¬ 
dering of recommended items yields smaller regret. 
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A. Proofs of Main Theorems 


A.l. Proof of Theorem 2 


Let Rt = Wt) be the regret of the learning algorithm at time t, where At is the recommended list at time t and Wt 

are the weights of items at time t. Let Et = {3e S E s.t. \w{e) — Wxt_i(e)(e)| > Ct_i Tt_i(e)} be the event that w{e) is 
not in the high-probability confidence interval around wxj_i(e)(e) for some e at time f; and let £t be the complement of 
£t, w{e) is in the high-probability confidence interval around Wxj_i(e)(e) for all e at time t. Then we can decompose the 
regret of CascadeUCBl as: 


R{n) = E 






-E 




( 10 ) 


Now we bound both terms in the above regret decomposition. 

The first term in (10) is small because all of our confidence intervals hold with high probability. In particular, Hoeffding’s 
inequality (Boucheron et al., 2013, Theorem 2.8) yields that for any e, s, and t: 


P{\w{e) - Ws(e)| > ct,s) < 2exp[-31ogf], 


and therefore: 


E 




n 


n 


n 


- ^ 2^^^exp[-31ogf] < ^ < 

e^E t—1 s—1 e^E t—1 e^E t—1 


3 


L. 


Since R* < 1, E 1{£J R*] < ^L. 

Recall that E* [•] = E [• | T-Lt], where "H* is the history of the learning agent up to choosing A(, the first t — \ observations 
and t actions (4). Based on this definition, we rewrite the second term in (10) as: 


E 




n L 

^=^^E[l{fi}E*[Rt]] < ^ E 

e^K+1 


r K n 


^ 


where equality (a) is due to the tower rule and that is only a function of "H*, and inequality (b) is due to the upper 

bound in Theorem 1 . 

Now we bound 127=1 Ge,e*,t} for any suboptimal item e. Select any optimal item e*. When event Et 

happens, |'i()(e) — Wxj_i(e)(e)| < Ct_i Xt_i(e)- Moreover, when event Ge,e*,t happens, U((e) > Uj(e*) by Theorem 1. 
Therefore, when both Ge,e*,t Et happen: 


which implies: 


w{e) + 2ct_i,x,_i(e) > Ut(e) > Vt(,e*) > w{e *), 


2Ct_i,Xt_i(e) > Ae,e* ■ 

Together with c„_Xt_i(e) > Ct-i,Tt_i(e)> this implies Tt_i(e) < Te,e*, where Tg e* = — logn. Therefore: 


K n 


K 


E! E! Ge^e*,t\ "E E! Ae,e* E! li{Ti-l(e) < Te_e» , Ge,e*,t} ■ 


( 11 ) 


e* = l t=l 


e* = l 


t=l 


Let: 


Me,e* — E! li{Tt-l(e) < Te,e*j Ge,e* ,t} 
t=l 

be the inner sum in (11). Now note that (i) the counter Tt_i(e) of item e increases by one when the event Ge,e*,t happens 
for any optimal item e*, (ii) the event Ge,e*,t happens for at most one optimal e* at any time f; and (iii) Tg i < ... < Te,K- 
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Based on these facts, it follows that Mg < Tg g*, and moreover ^e,e* ^ '^e,K- Therefore, the right-hand side of 

(11) can be bounded from above by: 


K 


K 


max ■ 


Ae,e-me,e- : 0 < TOe.e- < Te.e* , ^e,e* '^e,K 


Since the gaps are decreasing, Ae,l > ... > Ae.K, the solution to the above problem is m* ^ = Tg^i, to* 2 = 'fe ,2 ~ 't'e.ij 
..., TO* ^ = Te^K — Te^K- 1 - Therefore, the value of (11) is bounded from above by: 


K 






A2 . A2 . , 

e,e. ^e,e —i 


6 log n. 


By Lemma 3 of Kveton et al. (2014a), the above term is bounded by log n. Finally, we chain all inequalities and sum 
over all suboptimal items e. 


A.2. Proof of Theorem 3 

Let R( = R{At,Wt) be the regret of the learning algorithm at time t, where At is the recommended list at time t and 
are the weights of items at time t. Let £t = {31 < e < K s.t. w{e) > U((e)} be the event that the attraction probability 
of at least one optimal item is above its upper confidence bound at time t. Let £t be the complement of event £f Then we 
can decompose the regret of CascadeKL-UCB as: 


i?(n) = E 






-E 






( 12 ) 


By Theorems 2 and 10 of Garivier & Cappe (201 1), thanks to the choice of the upper confidence bound Uj, the first term 
in (12) is bounded as E E"=i ^{^t] R-t] < 7iT log log n. As in the proof of Theorem 2, we rewrite the second term as: 


E 




= y]E [!{£*} Et [R*]] < ^ E 


e^K+1 


K 




Now note that for any suboptimal item e and Te,e* > 0: 


E 


K n 




< E 


K n 


E! E! Ae,e*l{T't-l(e) < Te,e*, Ge,e*,t} 


(13) 


K 


E Ae.e*E 


Let: 


e*=l 


1 + e 


Ell{Tt-l(e) >Tf.,e*i £t, Ge^e*,t} 


.t=l 


- (log n + 3 log log n). 


’ L>KL(w(e) II t(;(e*)) 

Then by the same argument as in Theorem 2 and Lemma 8 of Garivier cfe Cappe (201 1): 

El{Tt-l(e) >re.e., £u Ge,e*,t} < ’ 

t=l 


E 


ipG) 


holds for any suboptimal e and optimal e*. So the second term in (13) is bounded from above by Now we bound 

the first term in (13). By the same argument as in the proof of Theorem 2: 


K n 


E! E! Ae_e*l{Tt_l(e) < Te,e* , Ge,e*,t} A 


Ae 1 


K 


DKLiw{e) II w(l)) 


+ E 


e*=2 


DKhiw(e) II w{e*)) DKhiw(e) || w{e* - 1)) 


(1 + £)(logn + 31og logn) 
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holds for any suboptimal item e. By Lemma 2, the leading constant is bounded as: 

_ Ae,l _. /_1_1_\ ^ Ae,i<-(1 + log(l/Ae,_R-)) 

DKhiwie)\\w{l)) V-DKL(t«(e) II u;(e*)) i:»KL(w(e) || u;(e* - l))y “ DY^i^{w{e)\\w{K)) 

Finally, we chain all inequalities and sum over all suboptimal items e. 


B. Technical Lemmas 

Lemma 1. Let A = (ai,..., gk) and B = (&i,..., bx) be any two lists of K items from IIk{E) such that ai = bj only 
if i = j. Let w ^ P in Assumption 1. Then: 


E 


K 


K 


w(afc) - Y\ w(6fc) 


.fe=i 




K 




'k-l 


n w(a*) 




K 


E[w(afc) - w(6fe)] I E[w(6j)]] . 


Proof. First, we prove that: 

K 


K 


K /k-l 


K 


n - n =n 1 n w(a*) I (^(afc) - w(&fc)) I F w{bj) 

j^k+1 


/c=l 




k—1 \i—l 


holds for any w G {0,1} . The proof is by induction on K. The claim holds obviously for K = 1. Now suppose that the 
claim holds for any A^B G IIk-i{E). LeiA^B G IIk{E)- Then: 


K 


K 


K 


K-l 


K-l 


K 


w{ak) - w{hk) = Y\ '^{^k) - w{bK) w{ak) + w(bK) w{ak) - w{bk) 




k^l 


k^l 


K-l 


= {w{aK) - w{bK)) w{ak) + w{bK) 


k^l 

K-l 


'K-l 


K-l 


w{ak) - w{bk) 


Ik^l 

K-l /k-l 




K 


= {w{aK) - w{bK)) w{ak) + E 11 - ^(h)) w{bj) 


k^l 


k—1 \i—l 

K 


yj=/c+l 


K /k-l 

=(n “ ^ih)) ( n w{bj) j. 

Ij=fc+i 


k—1 \i—l 


The third equality is by our induction hypothesis. Finally, note that w is drawn from a factored distribution. Therefore, we 
can decompose the expectation of the product as a product of expectations, and our claim follows. ■ 

Lemma 2. Let pi > ... > px > p be K + 1 probabilities and Ak = Pk — pfar 1 < k < K. Then: 


A^ 


K 


-Dkl(pIIpi) 


Ea. 


C_^_^_ 

V-Dkl(p II Pfe) -Dkl(p II Pfe-i) 


< 


Aj<-(1 + log(l/AK)) 
Dki.{p\\pk) 


Proof First, we note that: 

Ai A f ^ ^ ^ ~ ^fe+i I 

-Dkl(pIIPi) ^ ^ V-DKL(pllPfc) DKL{p\\Pk-l)) ^^DKhipWPk) DKhipWPK)' 

The summation over k can be bounded from above by a definite integral: 

y Afc - Afc+I ^ y Ak - Ak+1 _1_ _1_ 

i^KL(pllPfc) i^KL(p||p + Afe) i^KL(p||p + *) -yA^^KL(p||p + a:)°^’ 
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where the first inequality follows from the fact that 1/-Dkl(p \\p + x) decreases on a; > 0. To the best of our knowledge, 
the integral of 1/-Dkl(p \ \p + x) over x does not have a simple analytic solution. Therefore, we integrate an upper bound 
on 1/I?kl(p II P + a^) which does. In particular, note that for any x > Ak- 

n / II , ^ ^ ^Kl(p||p +Ak) DKhipWPK) 

-Dkl(p II P + a;) > -T- X = -^- X 

because Dkl(p || P + a;) is convex, increasing in x > 0, and its minimum is attained at a; = 0. Therefore: 

Iak DKhip\\p + x) “ -Dkl(pIIpk) Iak ^kl(pIIpk) ’ 

Finally, we chain all inequalities and get the final result. ■ 




